Some Experiments on Clustering Similar Sentences of Texts in Portuguese
نویسندگان
چکیده
Identifying similar text passages plays an important role in many applications in NLP, such as paraphrase generation, automatic summarization, etc. This paper presents some experiments on detecting and clustering similar sentences of texts in Brazilian Portuguese. We propose an evalution framework based on an incremental and unsupervised clustering method which is combined with statistical similarity metrics to measure the semantic distance between sentences. Experiments show that this method is robust even to treat small data sets. It has achieved 86% and 93% of F-measure and Purity, respectively, and 0.037 of Entropy for the best case.
منابع مشابه
Comparing k-means clusters on parallel Persian-English corpus
This paper compares clusters of aligned Persian and English texts obtained from k-means method. Text clustering has many applications in various fields of natural language processing. So far, much English documents clustering research has been accomplished. Now this question arises, are the results of them extendable to other languages? Since the goal of document clustering is grouping of docum...
متن کاملAutomatic Multi-Document Arabic Text Summarization Using Clustering and Keyphrase Extraction
Automatic text summarization has become important due to the rapid growth of information texts since it is very difficult for human beings to manually summarize large documents of texts. A full understanding of the document is essential to form an ideal summary. However, achieving full understanding is either difficult or impossible for computers. Therefore, selecting important sentences from t...
متن کاملTwo Corpus Based Experiments with the Portuguese and English Wordnets
This paper presents two experiments with real world applications of word sense disambiguation, wordnets and dependency parsing. The first is an effort towards a portuguese wordnet annotated corpus. We manually annotated 30 sentences using OpenWordNet-PT as a lexicon and then compared the results with an automatic annotation. In addition to the system’s evaluation, the results provided valuable ...
متن کاملSentence Alignment of Brazilian Portuguese and English Parallel Texts
Parallel texts – texts in one language and their translations to other languages – are becoming more and more available nowadays on the Web. Aligning these texts means to find some correspondence between them, in sentence level, for instance. In this paper we describe some experiments done with Brazilian Portuguese and English parallel texts using five well known sentence alignment methods. The...
متن کاملSIMBA: An Extractive Multi-document Summarization System for Portuguese
This is a proposal for demonstration of simba in PROPOR 2012. simba is an extractive multi-document summarization system that aims at producing generic summaries guided by a compression rate defined by the user. It uses a double-clustering approach to find the relevant information in a set of texts. In addition, simba uses a sentence simplification procedure as a mean to ensure summary compress...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2008